Debemos de utilizar la base de datos Weekly, esta base de datos contiene información del rendimiento porcentual semanal para el índice bursátil S&P 500 entre 1990 y 2010.
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
Tenemos una base de datos con 1089 observaciones sobre las siguientes 9 variables.
library(psych)
# Gráfico de dispersión
pairs.panels(Weekly[,-9],
method = "pearson",
hist.col = "#00AFBB",
density = TRUE,
)Como podemos apreciar, en general no pareciera que hubiesen muchas variables correlacionadas, pero si podemos apreciar una correlación positiva entre el año y la variable Volume. La variable volume hace referencia al volumen de acciones negociadas (número promedio de acciones diarias negociadas en miles de millones), por lo que a medida que pasa el tiempo, podemos apreciar como se aumenta también el volumen de acciones negociadas.
Le ejecutamos a todos los datos, utilizando Direction como la variable respuesta y las variables lag más la variable volume como predictoras, una regresión logística
weeklyrlo <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = "binomial")
summary(weeklyrlo)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Podemos apreciar que la variable lag2, es la única variables significativa.
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
predichos=as.factor(ifelse(test = weeklyrlo$fitted.values > 0.5, yes = "Up", no = "Down"))
verdaderos=as.factor(Weekly$Direction)
matriz<-confusionMatrix(predichos, verdaderos)
matriz## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 54 48
## Up 430 557
##
## Accuracy : 0.5611
## 95% CI : (0.531, 0.5908)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.369
##
## Kappa : 0.035
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.11157
## Specificity : 0.92066
## Pos Pred Value : 0.52941
## Neg Pred Value : 0.56434
## Prevalence : 0.44444
## Detection Rate : 0.04959
## Detection Prevalence : 0.09366
## Balanced Accuracy : 0.51612
##
## 'Positive' Class : Down
##
| Reference | ||
|---|---|---|
| Predicted | Down | Up |
| Down | A | B |
| Up | C | D |
weeklyrlo2 <- glm(Direction ~ Lag2, data = Weekly, family = "binomial", subset = w_entrenamiento)
summary(weeklyrlo2)##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly,
## subset = w_entrenamiento)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
Weekly.20092010 <- Weekly[!w_entrenamiento, ]
Direction.20092010 <- Direction[!w_entrenamiento]
p <- predict(weeklyrlo2, Weekly.20092010, type = "response")
predichos=as.factor(ifelse(test = p > 0.5, yes = "Up", no = "Down"))
verdaderos=as.factor(Direction.20092010)
matriz<-confusionMatrix(predichos, verdaderos)
matriz## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.20930
## Specificity : 0.91803
## Pos Pred Value : 0.64286
## Neg Pred Value : 0.62222
## Prevalence : 0.41346
## Detection Rate : 0.08654
## Detection Prevalence : 0.13462
## Balanced Accuracy : 0.56367
##
## 'Positive' Class : Down
##
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = w_entrenamiento)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
pred.lda <- predict(ldapre, Weekly.20092010)
verdaderos=as.factor(Direction.20092010)
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.20930
## Specificity : 0.91803
## Pos Pred Value : 0.64286
## Neg Pred Value : 0.62222
## Prevalence : 0.41346
## Detection Rate : 0.08654
## Detection Prevalence : 0.13462
## Balanced Accuracy : 0.56367
##
## 'Positive' Class : Down
##
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = w_entrenamiento)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
pred.qda <- predict(qdapre, Weekly.20092010)
matriz<-confusionMatrix(pred.qda$class, verdaderos)
matriz## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5865
## Prevalence : 0.4135
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Down
##
library(class)
entrenamiento.x <- as.matrix(Lag2[w_entrenamiento])
prueba.x <- as.matrix(Lag2[!w_entrenamiento])
entrenamiento.Direction <- Direction[w_entrenamiento]
pred.knn <- knn(entrenamiento.x, prueba.x, entrenamiento.Direction, k = 1)
confusionMatrix(pred.knn, Direction.20092010)## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 29
## Up 22 32
##
## Accuracy : 0.5096
## 95% CI : (0.4097, 0.609)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9540
##
## Kappa : 0.0127
##
## Mcnemar's Test P-Value : 0.4008
##
## Sensitivity : 0.4884
## Specificity : 0.5246
## Pos Pred Value : 0.4200
## Neg Pred Value : 0.5926
## Prevalence : 0.4135
## Detection Rate : 0.2019
## Detection Prevalence : 0.4808
## Balanced Accuracy : 0.5065
##
## 'Positive' Class : Down
##
| GLN | LDA | QDA | KNN | |
|---|---|---|---|---|
| Precisión | 62.5 | 62.5 | 58.65 | 50 |
| Sensitividad | 20.93 | 20.93 | 0 | 48.84 |
| Especificidad | 91.80 | 91.80 | 100 | 50.82 |
Por lo que podemos concluir que tanto el modelo GLN como LDA fueron los que mejor resultados dieron.
A continuación evaluamos KNN con los valores de K = 10, 20, 50 y 100
# Para K = 10
pred.knn <- knn(entrenamiento.x, prueba.x, entrenamiento.Direction, k = 10)
z <-confusionMatrix(pred.knn, Direction.20092010)
z$table## Reference
## Prediction Down Up
## Down 16 19
## Up 27 42
## [1] "Accuracy: 0.5577 Sensitivity: 0.3721 Specificity: 0.6885"
# Para k = 20
pred.knn <- knn(entrenamiento.x, prueba.x, entrenamiento.Direction, k = 20)
z <-confusionMatrix(pred.knn, Direction.20092010)
z$table## Reference
## Prediction Down Up
## Down 21 21
## Up 22 40
## [1] "Accuracy: 0.5865 Sensitivity: 0.4884 Specificity: 0.6557"
# Para k = 50
pred.knn <- knn(entrenamiento.x, prueba.x, entrenamiento.Direction, k = 50)
z <-confusionMatrix(pred.knn, Direction.20092010)
z$table## Reference
## Prediction Down Up
## Down 21 24
## Up 22 37
## [1] "Accuracy: 0.5577 Sensitivity: 0.4884 Specificity: 0.6066"
# Para k = 50
pred.knn <- knn(entrenamiento.x, prueba.x, entrenamiento.Direction, k = 100)
z <-confusionMatrix(pred.knn, Direction.20092010)
z$table## Reference
## Prediction Down Up
## Down 10 13
## Up 33 48
## [1] "Accuracy: 0.5577 Sensitivity: 0.2326 Specificity: 0.7869"
Ahora probaremos una regresión logística con las diferentes configuraciones de variables predictoras: volume, lag2 + lag3, lag1 + lg2, lag1 + lag2 + volume
# para volume
weeklyrlo2 <- glm(Direction ~ Volume, data = Weekly, family = "binomial", subset = w_entrenamiento)
p <- predict(weeklyrlo2, Weekly.20092010, type = "response")
predichos=as.factor(ifelse(test = p > 0.5, yes = "Up", no = "Down"))
verdaderos=as.factor(Direction.20092010)
matriz<-confusionMatrix(predichos, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 31 47
## Up 12 14
## [1] "Accuracy: 0.4327 Sensitivity: 0.7209 Specificity: 0.2295"
# para lag2 + lag3
weeklyrlo2 <- glm(Direction ~ Lag2 + Lag3, data = Weekly, family = "binomial", subset = w_entrenamiento)
p <- predict(weeklyrlo2, Weekly.20092010, type = "response")
predichos=as.factor(ifelse(test = p > 0.5, yes = "Up", no = "Down"))
verdaderos=as.factor(Direction.20092010)
matriz<-confusionMatrix(predichos, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 8 4
## Up 35 57
## [1] "Accuracy: 0.625 Sensitivity: 0.186 Specificity: 0.9344"
# para lag1 + lag2
weeklyrlo2 <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial", subset = w_entrenamiento)
p <- predict(weeklyrlo2, Weekly.20092010, type = "response")
predichos=as.factor(ifelse(test = p > 0.5, yes = "Up", no = "Down"))
verdaderos=as.factor(Direction.20092010)
matriz<-confusionMatrix(predichos, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 7 8
## Up 36 53
## [1] "Accuracy: 0.5769 Sensitivity: 0.1628 Specificity: 0.8689"
# para lag1 + lag2 + volume
weeklyrlo2 <- glm(Direction ~ Lag1 + Lag2 + Volume, data = Weekly, family = "binomial", subset = w_entrenamiento)
p <- predict(weeklyrlo2, Weekly.20092010, type = "response")
predichos=as.factor(ifelse(test = p > 0.5, yes = "Up", no = "Down"))
verdaderos=as.factor(Direction.20092010)
matriz<-confusionMatrix(predichos, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 27 33
## Up 16 28
## [1] "Accuracy: 0.5288 Sensitivity: 0.6279 Specificity: 0.459"
Ahora probaremos una LDA con las diferentes configuraciones de variables predictoras: volume, lag2 + lag3, lag1 + lg2, lag1 + lag2 + volume
# Para volume
ldapre <- lda(Direction ~ Volume, data = Weekly, subset = w_entrenamiento)
pred.lda <- predict(ldapre, Weekly.20092010)
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 33 47
## Up 10 14
## [1] "Accuracy: 0.4519 Sensitivity: 0.7674 Specificity: 0.2295"
# Para lag2 + lag3
ldapre <- lda(Direction ~ Lag2 + Lag3, data = Weekly, subset = w_entrenamiento)
pred.lda <- predict(ldapre, Weekly.20092010)
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 8 4
## Up 35 57
## [1] "Accuracy: 0.625 Sensitivity: 0.186 Specificity: 0.9344"
# Para lag1 + lag2
ldapre <- lda(Direction ~ Lag1 + Lag2, data = Weekly, subset = w_entrenamiento)
pred.lda <- predict(ldapre, Weekly.20092010)
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 7 8
## Up 36 53
## [1] "Accuracy: 0.5769 Sensitivity: 0.1628 Specificity: 0.8689"
# Para lag1 + lag2 + volume
ldapre <- lda(Direction ~ Lag1 + Lag2 + Volume, data = Weekly, subset = w_entrenamiento)
pred.lda <- predict(ldapre, Weekly.20092010)
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 27 33
## Up 16 28
## [1] "Accuracy: 0.5288 Sensitivity: 0.6279 Specificity: 0.459"
Ahora probaremos una QDA con las diferentes configuraciones de variables predictoras: volume, lag2 + lag3, lag1 + lg2, lag1 + lag2 + volume
# para volume
qdapre <- qda(Direction ~ Volume, data = Weekly, subset = w_entrenamiento)
pred.qda <- predict(qdapre, Weekly.20092010)
matriz<-confusionMatrix(pred.qda$class, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 43 59
## Up 0 2
## [1] "Accuracy: 0.4327 Sensitivity: 1 Specificity: 0.0328"
# para lag2 + lag3
qdapre <- qda(Direction ~ Lag2 + Lag3, data = Weekly, subset = w_entrenamiento)
pred.qda <- predict(qdapre, Weekly.20092010)
matriz<-confusionMatrix(pred.qda$class, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 4 2
## Up 39 59
## [1] "Accuracy: 0.6058 Sensitivity: 0.093 Specificity: 0.9672"
# para lag1 + lag2
qdapre <- qda(Direction ~ Lag1 + Lag2, data = Weekly, subset = w_entrenamiento)
pred.qda <- predict(qdapre, Weekly.20092010)
matriz<-confusionMatrix(pred.qda$class, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 7 10
## Up 36 51
## [1] "Accuracy: 0.5577 Sensitivity: 0.1628 Specificity: 0.8361"
# para lag1 + lag2 + volume
qdapre <- qda(Direction ~ Lag1 + Lag2 + Volume, data = Weekly, subset = w_entrenamiento)
pred.qda <- predict(qdapre, Weekly.20092010)
matriz<-confusionMatrix(pred.qda$class, verdaderos)
matriz$table## Reference
## Prediction Down Up
## Down 31 44
## Up 12 17
## [1] "Accuracy: 0.4615 Sensitivity: 0.7209 Specificity: 0.2787"
Auto <- Auto[c(-10,-11,-12,-13,-14,-15,-16,-17,-18)]
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] <- 1
Auto <- data.frame(Auto, mpg01)## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ mpg01 : num 0 0 0 0 0 0 0 0 0 0 ...
# Sabiendo que name es un factor, lo excluimos del análisis
pairs.panels(Auto[,-9],
method = "pearson",
hist.col = "#00AFBB",
density = TRUE,
)Podemos apreciar que existe una correlación negativa con las variables cylinders, displacement, horsepower y weight
## Call:
## lda(mpg01 ~ displacement + weight + cylinders + horsepower, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4965986 0.5034014
##
## Group means:
## displacement weight cylinders horsepower
## 0 269.9932 3612.719 6.753425 128.78767
## 1 114.3243 2330.216 4.128378 77.84459
##
## Coefficients of linear discriminants:
## LD1
## displacement -0.0002217597
## weight -0.0008836637
## cylinders -0.5336426773
## horsepower 0.0014962787
pred.lda <- predict(autos.lda, test, type="response")
matriz <- confusionMatrix(pred.lda$class, as.factor(test$mpg01))
matriz$table## Reference
## Prediction 0 1
## 0 45 6
## 1 5 42
## [1] "Accuracy: 0.8878 Sensitivity: 0.9 Specificity: 0.875"
Podemos apreciar que la tasa de error es de un 10.2%
## Call:
## qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4965986 0.5034014
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.753425 3612.719 269.9932 128.78767
## 1 4.128378 2330.216 114.3243 77.84459
pred.qda <- predict(autos.qda, test)
matriz <- confusionMatrix(pred.qda$class, as.factor(test$mpg01))
matriz$table## Reference
## Prediction 0 1
## 0 46 6
## 1 4 42
## [1] "Accuracy: 0.898 Sensitivity: 0.92 Specificity: 0.875"
Como podemos ver la tasa de error es del 11.22%
autos.glm <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = train, family = binomial)
summary(autos.glm)##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4090 -0.1003 0.1220 0.3751 3.4116
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.3555043 2.0347410 6.072 1.26e-09 ***
## cylinders -0.3974473 0.4141795 -0.960 0.33726
## weight -0.0016851 0.0007836 -2.150 0.03152 *
## displacement -0.0072747 0.0094229 -0.772 0.44010
## horsepower -0.0469950 0.0161218 -2.915 0.00356 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 407.56 on 293 degrees of freedom
## Residual deviance: 154.16 on 289 degrees of freedom
## AIC: 164.16
##
## Number of Fisher Scoring iterations: 7
p <- predict(autos.glm, test, type = "response")
predichos=as.factor(ifelse(test = p > 0.5, yes = 1, no = 0))
verdaderos=as.factor(test$mpg01)
matriz<-confusionMatrix(predichos, verdaderos)
matriz$table## Reference
## Prediction 0 1
## 0 46 7
## 1 4 41
## [1] "Accuracy: 0.8878 Sensitivity: 0.92 Specificity: 0.8542"
La tasa de error es de 11.22%
train.X2 = cbind(train$displacement, train$weight, train$cylinders, train$year)
test.X2 = cbind(test$displacement, test$weight, test$cylinders, test$year)
train.Y2 = cbind(train$mpg01)# Para K = 10
pred.knn <- knn(train.X2, test.X2, train.Y2, k = 10)
z <-confusionMatrix(pred.knn, as.factor(test$mpg01))
z$table## Reference
## Prediction 0 1
## 0 42 3
## 1 8 45
## [1] "Accuracy: 0.8878 Sensitivity: 0.84 Specificity: 0.9375"
La tasa de error el de 10.2%
# Para K = 50
pred.knn <- knn(train.X2, test.X2, train.Y2, k = 50)
z <-confusionMatrix(pred.knn, as.factor(test$mpg01))
z$table## Reference
## Prediction 0 1
## 0 42 4
## 1 8 44
## [1] "Accuracy: 0.8776 Sensitivity: 0.84 Specificity: 0.9167"
La tasa de error es del 10.2%
# Para K = 100
pred.knn <- knn(train.X2, test.X2, train.Y2, k = 100)
z <-confusionMatrix(pred.knn, as.factor(test$mpg01))
z$table## Reference
## Prediction 0 1
## 0 43 4
## 1 7 44
## [1] "Accuracy: 0.8878 Sensitivity: 0.86 Specificity: 0.9167"
La tasa de error es del 10.2%
Podemos ver que a partir de k = 10, la tasa de error se mantiene.
## [1] 8
## [1] 6561
## [1] 1000
## [1] 2.2518e+15
## [1] 2248091
x <- 1:10
plot(x, Power3(x, 2), log = "xy", xlab = "Log de x", ylab = "Log de x^2", main = "Log de x^2 vs Log de x")Se requiere usar la base de datos “Boston”, la cual a continuación podemos apreciar las variables que posee:
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Ahora calcularemos la mediana de la variiable crim para crear una nueva variable, la cual será binaria en donde 0 = crim <= mediana ó 1 = crim > mediana
mediana_crim <- rep(0, length(Boston$crim))
mediana_crim[Boston$crim > median(Boston$crim)] <- 1
Boston$mediana_crim = mediana_crim
head(Boston)## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv mediana_crim
## 1 4.98 24.0 0
## 2 9.14 21.6 0
## 3 4.03 34.7 0
## 4 2.94 33.4 0
## 5 5.33 36.2 0
## 6 5.21 28.7 0
Ahora construiremos nuestros datos de entrenamiento y de prueba
set.seed(123)
train_index <- sample(1:nrow(Boston), 0.8 * nrow(Boston))
test_index <- setdiff(1:nrow(Boston), train_index)Ahora para entender un poco mejor las variables realizaremos un gráfico de dispersión y de correlaciones
Podemos apreciar que respecto a la variable mediana_crim, las variables nox, rad, dis, age, tax e indus son las que más están correlacionadas. Para este ejercicio se utilizarán modelos partiendo de una variable (la de mayor correlación) y se irán agregando nuevas variables para determinar cual es el mejor modelo para cada ténica.
Comenzando con modelos de regresión logística:
glm_13.fit <- glm(mediana_crim ~ nox, data = train, family = binomial)
probs <- predict(glm_13.fit, test[,-15], type = "response")
predichos=as.factor(ifelse(test = probs > 0.5, yes = 1, no = 0))
verdaderos=as.factor(test[,15])
matriz<-confusionMatrix(predichos, verdaderos)
matriz$table## Reference
## Prediction 0 1
## 0 40 5
## 1 9 48
## [1] "Accuracy: 0.8627 Sensitivity: 0.8163 Specificity: 0.9057"
glm_13.fit <- glm(mediana_crim ~ nox + rad, data = train, family = binomial)
probs <- predict(glm_13.fit, test[,-15], type = "response")
predichos=as.factor(ifelse(test = probs > 0.5, yes = 1, no = 0))
matriz<-confusionMatrix(predichos, verdaderos)
matriz$table## Reference
## Prediction 0 1
## 0 46 7
## 1 3 46
## [1] "Accuracy: 0.902 Sensitivity: 0.9388 Specificity: 0.8679"
glm_13.fit <- glm(mediana_crim ~ nox + rad + dis, data = train, family = binomial)
probs <- predict(glm_13.fit, test[,-15], type = "response")
predichos=as.factor(ifelse(test = probs > 0.5, yes = 1, no = 0))
matriz<-confusionMatrix(predichos, verdaderos)
matriz$table## Reference
## Prediction 0 1
## 0 46 7
## 1 3 46
## [1] "Accuracy: 0.902 Sensitivity: 0.9388 Specificity: 0.8679"
Como podemos apreciar en los anteriores modelos, con las variables nox y rad como únicas predictoras se obtuvo el mejor modelo, cuando agregamos más variables predictoras no se mejoran los resultados: Accuracy: 0.902 Sensitivity: 0.9388 Specificity: 0.8679
Ahora seguimos con los modelos LDA
ldapre <- lda(mediana_crim ~ nox, data = train)
pred.lda <- predict(ldapre, test[,-15])
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz$table## Reference
## Prediction 0 1
## 0 42 11
## 1 7 42
## [1] "Accuracy: 0.8235 Sensitivity: 0.8571 Specificity: 0.7925"
ldapre <- lda(mediana_crim ~ nox + rad, data = train)
pred.lda <- predict(ldapre, test[,-15])
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz$table## Reference
## Prediction 0 1
## 0 47 12
## 1 2 41
## [1] "Accuracy: 0.8627 Sensitivity: 0.9592 Specificity: 0.7736"
ldapre <- lda(mediana_crim ~ nox + rad + dis, data = train)
pred.lda <- predict(ldapre, test[,-15])
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz$table## Reference
## Prediction 0 1
## 0 45 11
## 1 4 42
## [1] "Accuracy: 0.8529 Sensitivity: 0.9184 Specificity: 0.7925"
ldapre <- lda(mediana_crim ~ nox + rad + dis + age, data = train)
pred.lda <- predict(ldapre, test[,-15])
matriz<-confusionMatrix(pred.lda$class, verdaderos)
matriz$table## Reference
## Prediction 0 1
## 0 45 13
## 1 4 40
## [1] "Accuracy: 0.8333 Sensitivity: 0.9184 Specificity: 0.7547"
Podemos ver que aumentando el número de variables no presenta un mejora en la precisión de nuestro modelo, de lo que se probaron el mejor fue el que se usó con las variables nox y rad con los siguientes parámetros: Accuracy: 0.8627 Sensitivity: 0.9592 Specificity: 0.7736
Modelos KNN
modelos_knn <- function(x_train, x_test, y_train, y_test, nk){
pred.knn <- knn(x_train, x_test, y_train, k = nk)
z <-confusionMatrix(pred.knn, y_test)
print(paste0("K = ", nk))
imprimirASS(z)
}entrenamiento.x <- as.matrix(train[,-15])
prueba.x <- as.matrix(test[,-15])
entrenamiento.y <- as.factor(train[,15])
prueba.y <- as.factor(test[,15])## [1] "K = 2"
## [1] "Accuracy: 0.9118 Sensitivity: 0.9184 Specificity: 0.9057"
## [1] "K = 3"
## [1] "Accuracy: 0.9216 Sensitivity: 0.9184 Specificity: 0.9245"
## [1] "K = 4"
## [1] "Accuracy: 0.9314 Sensitivity: 0.9388 Specificity: 0.9245"
## [1] "K = 5"
## [1] "Accuracy: 0.951 Sensitivity: 0.9796 Specificity: 0.9245"
## [1] "K = 6"
## [1] "Accuracy: 0.9314 Sensitivity: 0.9592 Specificity: 0.9057"
## [1] "K = 7"
## [1] "Accuracy: 0.9314 Sensitivity: 0.9592 Specificity: 0.9057"
## [1] "K = 8"
## [1] "Accuracy: 0.9314 Sensitivity: 0.9592 Specificity: 0.9057"
## [1] "K = 9"
## [1] "Accuracy: 0.9314 Sensitivity: 0.9592 Specificity: 0.9057"
## [1] "K = 10"
## [1] "Accuracy: 0.9314 Sensitivity: 0.9592 Specificity: 0.9057"
## [1] "K = 11"
## [1] "Accuracy: 0.9314 Sensitivity: 0.9592 Specificity: 0.9057"
## [1] "K = 12"
## [1] "Accuracy: 0.9216 Sensitivity: 0.9592 Specificity: 0.8868"
## [1] "K = 13"
## [1] "Accuracy: 0.9118 Sensitivity: 0.9388 Specificity: 0.8868"
## [1] "K = 14"
## [1] "Accuracy: 0.9216 Sensitivity: 0.9388 Specificity: 0.9057"
## [1] "K = 15"
## [1] "Accuracy: 0.902 Sensitivity: 0.9388 Specificity: 0.8679"
## [1] "K = 16"
## [1] "Accuracy: 0.902 Sensitivity: 0.9388 Specificity: 0.8679"
## [1] "K = 17"
## [1] "Accuracy: 0.902 Sensitivity: 0.9388 Specificity: 0.8679"
## [1] "K = 18"
## [1] "Accuracy: 0.8922 Sensitivity: 0.9184 Specificity: 0.8679"
## [1] "K = 19"
## [1] "Accuracy: 0.8725 Sensitivity: 0.9184 Specificity: 0.8302"
## [1] "K = 20"
## [1] "Accuracy: 0.8824 Sensitivity: 0.9184 Specificity: 0.8491"
Podemos ver con un K = 5 se obtuvo la mayor precisión, lo cual también determina, que éste es el mejor modelo de todos los que se probaron en el ejercicio
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:psych':
##
## outlier
set.seed(1)
train <- sample(1:nrow(Boston), 400)
Boston.train <- Boston[train, -14]
Boston.test <- Boston[-train, -14]
Y.train <- Boston[train, 14]
Y.test <- Boston[-train, 14]
boston1 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = ncol(Boston) - 1, ntree = 500)
boston2 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = (ncol(Boston) - 1) / 2, ntree = 500)
boston3 <- randomForest(Boston.train, y = Y.train, xtest = Boston.test, ytest = Y.test, mtry = sqrt(ncol(Boston) - 1), ntree = 500)
valores <- data.frame(c(1:500))
names(valores) <- "arboles"
valores["MSE1"] <- boston1$test$mse
valores["MSE2"] <- boston2$test$mse
valores["MSE3"] <- boston3$test$mse
library(ggplot2)
library(reshape2)
dd = melt(valores, id=c("arboles"))
theme_set(theme_bw())
ggplot(dd) + geom_line(aes(x=arboles, y=value, color=variable)) +
scale_color_manual(name = "m = ",
labels = c("p","p/2","\u221Ap"),
values=c("green","red","blue")) +
labs(x = "Número de árboles",
y = "Error de clasificación") +
theme(legend.position="top")Como podemos apreciar en el gráfico anterior con solamente un árbol el erro tiende a ser muy alto, pero a medida que se aumenta el número de árboles, se estabiliza este error, aproximadamente a partir de 100 árboles, este comportamiento se presenta en general con los tres valores de m. Respecto a con cual m se presenta el menor error, podemos ver que en mi caso fue con \(m=p\), aunque si cambiamos la semilla este resultado puede cambiar.
train <- sample(1:nrow(Carseats), 300)
Carseats.train <- Carseats[train, ]
Carseats.test <- Carseats[-train, ]library(rpart)
library(rpart.plot)
tree <- rpart(Sales~., data=Carseats.train)
rpart.plot(tree, box.palette="RdBu", nn=TRUE)## [1] 4.197228
Como podemos apreciar el error MSE de la prueba es de 4.197228
##
## Regression tree:
## rpart(formula = Sales ~ ., data = Carseats.train)
##
## Variables actually used in tree construction:
## [1] Advertising Age CompPrice Income Population Price
## [7] ShelveLoc
##
## Root node error: 2247.1/300 = 7.4904
##
## n= 300
##
## CP nsplit rel error xerror xstd
## 1 0.211723 0 1.00000 1.01804 0.082943
## 2 0.123518 1 0.78828 0.80759 0.065671
## 3 0.057792 2 0.66476 0.71205 0.057322
## 4 0.039590 3 0.60697 0.68368 0.053027
## 5 0.032717 4 0.56738 0.67813 0.053775
## 6 0.026549 5 0.53466 0.66341 0.050763
## 7 0.021459 6 0.50811 0.62629 0.048310
## 8 0.018375 7 0.48665 0.63310 0.050340
## 9 0.017637 9 0.44990 0.64361 0.051830
## 10 0.017500 10 0.43226 0.64903 0.051778
## 11 0.015287 11 0.41476 0.62861 0.050278
## 12 0.014037 13 0.38419 0.62867 0.050364
## 13 0.013272 14 0.37015 0.64636 0.051480
## 14 0.012158 15 0.35688 0.64895 0.051655
## 15 0.011647 16 0.34472 0.64894 0.051919
## 16 0.010000 17 0.33308 0.64854 0.051978
## 7
## 7
## [1] 0.02145892
Con la validación cruzada obtenemos un tamaño de arbol igual a 7 y un cp de 0.02145892
## [1] 4.54531
Podemos ver que se aumenta el MSE en nuestro caso después de la poda
bag.carseats <- randomForest(Sales ~ ., data = Carseats.train, mtry = (ncol(Carseats) - 1), importance = TRUE)
yhat.bag <- predict(bag.carseats, newdata = Carseats.test)
mean((yhat.bag - Carseats.test$Sales)^2)## [1] 2.381242
Podemos ver como se mejoró a 2.38 el MSE con este enfoque.
## %IncMSE IncNodePurity
## CompPrice 38.928557 272.987960
## Income 11.725449 122.545479
## Advertising 14.937292 128.773274
## Population 2.518008 95.322579
## Price 71.644649 716.074862
## ShelveLoc 69.101715 596.025736
## Age 17.537409 167.453663
## Education 2.793162 64.343849
## Urban -2.073262 8.985155
## US 4.206136 9.651551
Y con esto podemos apreciar que Price, ShelveLoc y CompPrice son las variables predictoras más importantes en este análisis
bag.carseats <- randomForest(Sales ~ ., data = Carseats.train, mtry = sqrt((ncol(Carseats) - 1)), importance = TRUE)
yhat.bag <- predict(bag.carseats, newdata = Carseats.test)
mean((yhat.bag - Carseats.test$Sales)^2)## [1] 2.80204
En este caso tenemos un MSE de 2.8
## %IncMSE IncNodePurity
## CompPrice 16.2755940 216.47915
## Income 7.8217396 184.80927
## Advertising 14.1807157 183.03683
## Population 0.6892522 155.21205
## Price 42.0256227 551.81898
## ShelveLoc 47.8554474 473.99364
## Age 12.1305637 213.32128
## Education 2.8662179 91.35868
## Urban -2.3838218 18.20191
## US 3.4288791 26.69131
También podemos ver que las variables más importantes siguen siendo las mismas que las que se declararon en el punto anterior.
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "SalePriceMM" "SpecialCH" "PriceDiff"
## Number of terminal nodes: 9
## Residual mean deviance: 0.707 = 559.3 / 791
## Misclassification error rate: 0.1512 = 121 / 800
Como podemos apreciar, el arbol tiene 9 nodos terminales y un error de entrenamiento de 0.1512. También podemos ver que las variables usadas en la construcción del árbol son LoyalCH, SalePriceMM, specialCH y PriceDiff
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1065.00 CH ( 0.61625 0.38375 )
## 2) LoyalCH < 0.5036 351 424.90 MM ( 0.29345 0.70655 )
## 4) LoyalCH < 0.264232 151 102.10 MM ( 0.10596 0.89404 )
## 8) LoyalCH < 0.051325 57 0.00 MM ( 0.00000 1.00000 ) *
## 9) LoyalCH > 0.051325 94 85.77 MM ( 0.17021 0.82979 ) *
## 5) LoyalCH > 0.264232 200 273.90 MM ( 0.43500 0.56500 )
## 10) SalePriceMM < 2.04 109 126.30 MM ( 0.26606 0.73394 )
## 20) SpecialCH < 0.5 83 78.43 MM ( 0.18072 0.81928 ) *
## 21) SpecialCH > 0.5 26 35.89 CH ( 0.53846 0.46154 ) *
## 11) SalePriceMM > 2.04 91 119.20 CH ( 0.63736 0.36264 ) *
## 3) LoyalCH > 0.5036 449 349.40 CH ( 0.86860 0.13140 )
## 6) LoyalCH < 0.764572 188 217.80 CH ( 0.73404 0.26596 )
## 12) PriceDiff < 0.265 113 153.40 CH ( 0.58407 0.41593 )
## 24) PriceDiff < -0.165 34 41.19 MM ( 0.29412 0.70588 ) *
## 25) PriceDiff > -0.165 79 95.30 CH ( 0.70886 0.29114 ) *
## 13) PriceDiff > 0.265 75 25.19 CH ( 0.96000 0.04000 ) *
## 7) LoyalCH > 0.764572 261 78.30 CH ( 0.96552 0.03448 ) *
Se escoge el nodo terminal 7, en este podemos apreciar que su creterio de división es LoyalCH > 0.764572, éste tiene 261 observaciones en este nodo con una desviación de 78.30. La predicción en este nodo es Sales = CH, también podemos ver que el 96.552% de las observaciones toman el valor de CH, mientras que el 3.448% toman el valor de MM.
En este caso podemos apreciar que la variable principal es LoyalCH, los nodos que nacen de este también usan LoyalCH para dividir el arbol. Podemos ver que en el nodo principal si LoyalCH es mayor o igual a 0.5036, sólo se usa PriceDiff como otra variable, mientras en caso contrario se utilizan las variables salePriceMM y SpecialCH.
tree.pred <- predict(tree.oj, OJ.test, type = "class")
matriz<-confusionMatrix(tree.pred, OJ.test$Purchase)
matriz## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 144 34
## MM 16 76
##
## Accuracy : 0.8148
## 95% CI : (0.7633, 0.8593)
## No Information Rate : 0.5926
## P-Value [Acc > NIR] : 4.577e-15
##
## Kappa : 0.6064
##
## Mcnemar's Test P-Value : 0.01621
##
## Sensitivity : 0.9000
## Specificity : 0.6909
## Pos Pred Value : 0.8090
## Neg Pred Value : 0.8261
## Prevalence : 0.5926
## Detection Rate : 0.5333
## Detection Prevalence : 0.6593
## Balanced Accuracy : 0.7955
##
## 'Positive' Class : CH
##
Podemos ver que la precisión del modelo es de 81.48%, siendo mucho más preciso para predecir CH con un 90% mientras que para predecir MM sólo es de un 69.09%.
## $size
## [1] 9 8 7 4 2 1
##
## $dev
## [1] 135 135 140 151 172 303
##
## $k
## [1] -Inf 0.000000 2.000000 4.666667 12.500000 145.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
valores <- data.frame(cv.oj$size, cv.oj$dev)
names(valores) <- c("Size", "Dev")
library(ggplot2)
library(reshape2)
theme_set(theme_bw())
ggplot(data = valores, aes(x=Size, y=Dev)) +
geom_line(color="red", linetype = "dashed") +
geom_point() +
scale_x_discrete(limits=c(1:9))El tamaño de árbol con menor error es de 7
##
## Classification tree:
## snip.tree(tree = tree.oj, nodes = c(4L, 10L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "SalePriceMM" "PriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7409 = 587.5 / 793
## Misclassification error rate: 0.1538 = 123 / 800
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "SalePriceMM" "SpecialCH" "PriceDiff"
## Number of terminal nodes: 9
## Residual mean deviance: 0.707 = 559.3 / 791
## Misclassification error rate: 0.1512 = 121 / 800
Vemos que practicamente no existe diferencia entre los dos árboles, siendo levemenete mayor en el arbol podado
prune.pred <- predict(prune.oj, OJ.test, type = "class")
matriz<-confusionMatrix(prune.pred, OJ.test$Purchase)
matriz## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 141 34
## MM 19 76
##
## Accuracy : 0.8037
## 95% CI : (0.7512, 0.8494)
## No Information Rate : 0.5926
## P-Value [Acc > NIR] : 1.152e-13
##
## Kappa : 0.5846
##
## Mcnemar's Test P-Value : 0.05447
##
## Sensitivity : 0.8812
## Specificity : 0.6909
## Pos Pred Value : 0.8057
## Neg Pred Value : 0.8000
## Prevalence : 0.5926
## Detection Rate : 0.5222
## Detection Prevalence : 0.6481
## Balanced Accuracy : 0.7861
##
## 'Positive' Class : CH
##
Vemos que bajó la precisión del modelo con respecto al árbol sin podar.
## Loaded gbm 2.1.5
lambdas = seq(0.001, 0.3, 0.005)
mse <- rep(0, length(lambdas))
for (i in 1:length(lambdas)) {
boost.hitters <- gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[i])
pred.train <- predict(boost.hitters, Hitters.train, n.trees = 1000)
mse[i] <- mean((pred.train - Hitters.train$Salary)^2)
}library(ggplot2)
mse_graph <- data.frame(lambdas)
mse_graph["MSE"]<- mse
names(mse_graph) <- c("Lambdas", "MSE")
theme_set(theme_bw())
ggplot(data = mse_graph, aes(x = Lambdas, y = MSE)) +
geom_line(color="red", linetype = "dashed") +
geom_point() mse2 <- rep(NA, length(lambdas))
for (i in 1:length(lambdas)) {
boost.hitters <- gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[i])
yhat <- predict(boost.hitters, Hitters.test, n.trees = 1000)
mse2[i] <- mean((yhat - Hitters.test$Salary)^2)
}library(ggplot2)
mse_graph2 <- data.frame(lambdas)
mse_graph2["MSE"]<- mse2
names(mse_graph2) <- c("Lambdas", "MSE")
theme_set(theme_bw())
ggplot(data = mse_graph2, aes(x = Lambdas, y = MSE)) +
geom_line(color="red", linetype = "dashed") +
geom_point() ## [1] 0.2475835
## [1] 0.166
lm810 <- lm(Salary ~ ., data = Hitters.train)
pred810 <- predict(lm810, Hitters.test)
mean((pred810 - Hitters.test$Salary)^2)## [1] 0.4917959
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit=pcr(Salary~., data=Hitters.train , scale=TRUE , validation ="CV")
validationplot(pcr.fit ,val.type="MSEP")## [1] 0.4661183
Podemos ver que el MSE por boosting es menor que el dado por la regresión lineal y componentes principales.
boost.hitters <- gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[which.min(mse2)])
summary(boost.hitters)## var rel.inf
## CAtBat CAtBat 17.7715143
## CWalks CWalks 9.0983419
## CRuns CRuns 8.1091446
## PutOuts PutOuts 7.8019290
## Walks Walks 7.3620230
## CHmRun CHmRun 6.4505765
## Assists Assists 5.6779659
## Hits Hits 5.0562630
## RBI RBI 4.9022635
## Years Years 4.8919157
## AtBat AtBat 4.3145631
## CHits CHits 4.0969282
## CRBI CRBI 3.8001772
## Runs Runs 3.6363269
## HmRun HmRun 3.3847743
## Errors Errors 2.3407756
## NewLeague NewLeague 0.5117111
## Division Division 0.4565112
## League League 0.3362948
La variable CatBat es la más importante.
bag.hitters <- randomForest(Salary ~ ., data = Hitters.train, mtry = 19)
yhat.bag <- predict(bag.hitters, newdata = Hitters.test)
mean((yhat.bag - Hitters.test$Salary)^2)## [1] 0.22971
Podemos ver que el MSE fue de 0.007 lo cual significa que es mucho mejor que los probados hasta ahora.
train <- 1:1000
Caravan$Purchase <- ifelse(Caravan$Purchase == "Yes", 1, 0)
Caravan.train <- Caravan[train, ]
Caravan.test <- Caravan[-train, ]bc <- gbm(Purchase ~ ., data = Caravan.train, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01)## Warning in gbm.fit(x = x, y = y, offset = offset, distribution =
## distribution, : variable 50: PVRAAUT has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution =
## distribution, : variable 71: AVRAAUT has no variation.
## var rel.inf
## PPERSAUT PPERSAUT 14.6627387
## MKOOPKLA MKOOPKLA 11.5135457
## MOPLHOOG MOPLHOOG 8.1166824
## MBERMIDD MBERMIDD 5.5659055
## PBRAND PBRAND 4.8848882
## MGODGE MGODGE 4.3094073
## ABRAND ABRAND 4.0899213
## MINK3045 MINK3045 3.7799757
## MAUT1 MAUT1 3.0595852
## MOSTYPE MOSTYPE 2.6279165
## PWAPART PWAPART 2.6176353
## MBERARBG MBERARBG 2.1522326
## MGODPR MGODPR 2.1053373
## MAUT2 MAUT2 2.0837180
## MSKC MSKC 2.0548149
## MSKA MSKA 1.8052960
## MSKB1 MSKB1 1.7245921
## PBYSTAND PBYSTAND 1.5426299
## MBERHOOG MBERHOOG 1.4526150
## MFWEKIND MFWEKIND 1.4470476
## MINK7512 MINK7512 1.3601434
## MGODRK MGODRK 1.3247916
## MGODOV MGODOV 1.2359721
## MOPLMIDD MOPLMIDD 1.1189140
## MINK4575 MINK4575 1.0021078
## MRELGE MRELGE 0.9752463
## MRELOV MRELOV 0.9168410
## MAUT0 MAUT0 0.8624586
## MINKGEM MINKGEM 0.8250010
## MFGEKIND MFGEKIND 0.7750144
## APERSAUT APERSAUT 0.7452557
## MBERBOER MBERBOER 0.6269871
## MINKM30 MINKM30 0.6230703
## MOSHOOFD MOSHOOFD 0.6128366
## MZFONDS MZFONDS 0.5781958
## PMOTSCO PMOTSCO 0.5172908
## MHHUUR MHHUUR 0.4691845
## MHKOOP MHKOOP 0.4467587
## MZPART MZPART 0.4351830
## MBERARBO MBERARBO 0.4270172
## MGEMLEEF MGEMLEEF 0.3961990
## MGEMOMV MGEMOMV 0.3886987
## MSKD MSKD 0.3409863
## PLEVEN PLEVEN 0.3215526
## MSKB2 MSKB2 0.2531789
## MINK123M MINK123M 0.2416957
## MOPLLAAG MOPLLAAG 0.1897649
## MRELSA MRELSA 0.1837443
## MBERZELF MBERZELF 0.1086964
## MFALLEEN MFALLEEN 0.1007280
## MAANTHUI MAANTHUI 0.0000000
## PWABEDR PWABEDR 0.0000000
## PWALAND PWALAND 0.0000000
## PBESAUT PBESAUT 0.0000000
## PVRAAUT PVRAAUT 0.0000000
## PAANHANG PAANHANG 0.0000000
## PTRACTOR PTRACTOR 0.0000000
## PWERKT PWERKT 0.0000000
## PBROM PBROM 0.0000000
## PPERSONG PPERSONG 0.0000000
## PGEZONG PGEZONG 0.0000000
## PWAOREG PWAOREG 0.0000000
## PZEILPL PZEILPL 0.0000000
## PPLEZIER PPLEZIER 0.0000000
## PFIETS PFIETS 0.0000000
## PINBOED PINBOED 0.0000000
## AWAPART AWAPART 0.0000000
## AWABEDR AWABEDR 0.0000000
## AWALAND AWALAND 0.0000000
## ABESAUT ABESAUT 0.0000000
## AMOTSCO AMOTSCO 0.0000000
## AVRAAUT AVRAAUT 0.0000000
## AAANHANG AAANHANG 0.0000000
## ATRACTOR ATRACTOR 0.0000000
## AWERKT AWERKT 0.0000000
## ABROM ABROM 0.0000000
## ALEVEN ALEVEN 0.0000000
## APERSONG APERSONG 0.0000000
## AGEZONG AGEZONG 0.0000000
## AWAOREG AWAOREG 0.0000000
## AZEILPL AZEILPL 0.0000000
## APLEZIER APLEZIER 0.0000000
## AFIETS AFIETS 0.0000000
## AINBOED AINBOED 0.0000000
## ABYSTAND ABYSTAND 0.0000000
Las variables PPERSAUT y MKOOPKLA son las más importantes
probs.test <- predict(bc, Caravan.test, n.trees = 1000, type = "response")
pred.test <- ifelse(probs.test > 0.2, 1, 0)
matriz <- confusionMatrix(as.factor(pred.test), as.factor(Caravan.test$Purchase))
matriz## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4495 277
## 1 38 12
##
## Accuracy : 0.9347
## 95% CI : (0.9273, 0.9415)
## No Information Rate : 0.9401
## P-Value [Acc > NIR] : 0.9446
##
## Kappa : 0.0541
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99162
## Specificity : 0.04152
## Pos Pred Value : 0.94195
## Neg Pred Value : 0.24000
## Prevalence : 0.94007
## Detection Rate : 0.93219
## Detection Prevalence : 0.98963
## Balanced Accuracy : 0.51657
##
## 'Positive' Class : 0
##
Vemos que la predicción de personas que en realidad harán la compra tiene una precisión del 3.46%
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
pred.test2 <- ifelse(probs.test > 0.2, 1, 0)
matriz <- confusionMatrix(as.factor(pred.test2), as.factor(Caravan.test$Purchase))
matriz## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4495 277
## 1 38 12
##
## Accuracy : 0.9347
## 95% CI : (0.9273, 0.9415)
## No Information Rate : 0.9401
## P-Value [Acc > NIR] : 0.9446
##
## Kappa : 0.0541
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99162
## Specificity : 0.04152
## Pos Pred Value : 0.94195
## Neg Pred Value : 0.24000
## Prevalence : 0.94007
## Detection Rate : 0.93219
## Detection Prevalence : 0.98963
## Balanced Accuracy : 0.51657
##
## 'Positive' Class : 0
##
Vemos que usando una regresión logística la precisión del modelo a la hora de predecir personas que realmente realizaron la compra es de un 3.46%
Se usará la base de datos ‘Smarket’, el cual representa los porcentajes diarios de rendimiento para el índice bursátil S&P 500 entre 2001 y 2005. En este se intentará predecir la variable ‘Direction’
train <- sample(nrow(Weekly), nrow(Weekly)*0.7)
Weekly$Direction <- ifelse(Weekly$Direction == "Up", 1, 0)
Weekly.train <- Weekly[train, ]
Weekly.test <- Weekly[-train, ]bf <- gbm(Direction ~ . - Year - Today, data = Weekly.train, distribution = "bernoulli", n.trees = 5000)
bprobs <- predict(bf, newdata = Weekly.test, n.trees = 5000)
bpred <- ifelse(bprobs > 0.5, 1, 0)
matriz <- confusionMatrix(as.factor(bpred), as.factor(Weekly.test$Direction))
matriz$table## Reference
## Prediction 0 1
## 0 84 100
## 1 54 89
## [1] "Accuracy: 0.5291 Sensitivity: 0.6087 Specificity: 0.4709"
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
bagprobs <- predict(bagf, newdata = Weekly.test)
bagpred <- ifelse(bagprobs > 0.5, 1, 0)
matriz <- confusionMatrix(as.factor(bagpred), as.factor(Weekly.test$Direction))
matriz$table## Reference
## Prediction 0 1
## 0 56 57
## 1 82 132
## [1] "Accuracy: 0.5749 Sensitivity: 0.4058 Specificity: 0.6984"
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
rfprobs <- predict(rff, newdata = Weekly.test)
rfpred <- ifelse(rfprobs > 0.5, 1, 0)
matriz <- confusionMatrix(as.factor(rfpred), as.factor(Weekly.test$Direction))
matriz$table## Reference
## Prediction 0 1
## 0 58 54
## 1 80 135
## [1] "Accuracy: 0.5902 Sensitivity: 0.4203 Specificity: 0.7143"
logf <- glm(Direction ~ . - Year - Today, data = Weekly.train, family = "binomial")
logprobs <- predict(logf, newdata = Weekly.test, type = "response")
logpred <- ifelse(logprobs > 0.5, 1, 0)
matriz <- confusionMatrix(as.factor(logpred), as.factor(Weekly.test$Direction))
matriz$table## Reference
## Prediction 0 1
## 0 7 8
## 1 131 181
## [1] "Accuracy: 0.5749 Sensitivity: 0.0507 Specificity: 0.9577"
lmf <- lm(Direction ~ . - Year - Today, data = Weekly.train)
lmprob <- predict(lmf, Weekly.test)
lmpred <- ifelse(lmprob > 0.5, 1, 0)
matriz <- confusionMatrix(as.factor(lmpred), as.factor(Weekly.test$Direction))
matriz$table## Reference
## Prediction 0 1
## 0 7 8
## 1 131 181
## [1] "Accuracy: 0.5749 Sensitivity: 0.0507 Specificity: 0.9577"
En general no se encuentran grandes diferencias entre ellos, pero de estos propuestos, el que mejor resultados da es el bagging con un accuraccy de 54.74%
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
gen <- function(x, y) {
x^2 + y^2 <= 1
}
df <- df %>%
rename(Var1 = X1, Var2 = X2) %>%
mutate(Clases = ifelse(gen(Var1, Var2),
'Clase 1',
'Clase 2'),
Clases = factor(Clases))
inTrain <- sample(nrow(df), nrow(df)*0.7)
train <- df[inTrain,]
test <- df[-inTrain,]
theme_set(theme_bw())
ggplot(df, aes(Var1, Var2, col = Clases)) +
geom_point(size = 2)svm1 <- svm(Clases ~ ., data = train,
kernel = 'linear',
scale = FALSE, cost = 10)
plot(svm1, data = train)## Reference
## Prediction Clase 1 Clase 2
## Clase 1 8 10
## Clase 2 4 8
## [1] "Accuracy: 0.5333 Sensitivity: 0.6667 Specificity: 0.4444"
svm2 <- svm(Clases ~ ., data = train,
kernel = 'polynomial', degree = 2,
scale = FALSE, cost = 1)
plot(svm2, train)## Reference
## Prediction Clase 1 Clase 2
## Clase 1 12 1
## Clase 2 0 17
## [1] "Accuracy: 0.9667 Sensitivity: 1 Specificity: 0.9444"
## Reference
## Prediction Clase 1 Clase 2
## Clase 1 11 2
## Clase 2 1 16
## [1] "Accuracy: 0.9 Sensitivity: 0.9167 Specificity: 0.8889"
Podemos apreciar que, tal como lo decía el enunciado, tanto la SVM de kernel radial como polinomial tienen un mejor rendimiento que el modelo con kernel lineal. El modelo que tuvo un mejor rendimiento en este experimento fue la SVM polinomial de grado 2, con una precisión del 100% con el set de entrenamiento.
df <- data.frame(y,x1,x2)
df <- df %>%
mutate(y = factor(y ))
theme_set(theme_bw())
ggplot(df, aes(x=x1, y=x2, col=y)) +
geom_point()##
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3452 -1.0996 -0.9886 1.1824 1.3738
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.038121 0.090234 -0.422 0.67269
## x1 0.008842 0.308455 0.029 0.97713
## x2 0.857039 0.309664 2.768 0.00565 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.86 on 499 degrees of freedom
## Residual deviance: 685.07 on 497 degrees of freedom
## AIC: 691.07
##
## Number of Fisher Scoring iterations: 4
probs <- predict(logreg_fit, data = train, type = "response")
preds <- rep(0, 500)
preds[probs > 0.5] <- 1
train['preds'] <- predstheme_set(theme_bw())
train <- train %>%
mutate(preds = factor(preds))
ggplot(train, aes(x=x1, y=x2, col=preds)) +
geom_point()##
## Call:
## glm(formula = y ~ poly(x1, 2) + x2, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2464 -0.7807 -0.5253 0.7573 1.8684
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.05209 0.11049 0.471 0.637
## poly(x1, 2)1 0.16859 2.62143 0.064 0.949
## poly(x1, 2)2 32.04520 3.01261 10.637 <2e-16 ***
## x2 0.88434 0.36713 2.409 0.016 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 692.86 on 499 degrees of freedom
## Residual deviance: 516.82 on 496 degrees of freedom
## AIC: 524.82
##
## Number of Fisher Scoring iterations: 4
probs <- predict(lm.fit, data = train, type = "response")
preds2 <- rep(0, 500)
preds2[probs > 0.5] <- 1theme_set(theme_bw())
train <- train %>%
mutate(preds2 = factor(preds2))
ggplot(train, aes(x=x1, y=x2, col=preds2)) +
geom_point()theme_set(theme_bw())
train['preds3'] <- preds3
ggplot(train, aes(x=x1, y=x2, col=preds3)) +
geom_point()svmr <- svm(y ~ x1 + x2, data = train,
kernel = 'radial',
scale = FALSE)
predsr <- predict(svmr, train)theme_set(theme_bw())
train['predsr'] <- predsr
ggplot(train, aes(x=x1, y=x2, col=predsr)) +
geom_point()Podemos ver a través de estos experimentos que las mquinas de soporte vectorial con metodos no lineas en datos que no tienen “perimetro” lineal, son muy efectivas, también si tienen perímetro lineal las svm con método lineal también son muy efectivas en estos casos.